{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/Maplemx/Agently/blob/main/playground/finding_connectable_pairs_from_text_tailers_and_headers.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Finding Connectable Pairs from Text Tailers and Headers"
      ],
      "metadata": {
        "id": "dAzfqHDCAXZe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Demo Description\n",
        "\n",
        "**Author:** Agently Team\n",
        "\n",
        "**Prompt Language:** English\n",
        "\n",
        "**Show Case Applicable Language:** Chinese or similar ideographic script languages\n",
        "\n",
        "**Agent Components:** None\n",
        "\n",
        "**Description:**\n",
        "\n",
        "This is a fun case I wrote because a developer ask a question in the chat group about how to find tailer-header pairs to connect among many text segments that scaned and recongnized by OCR.\n",
        "\n",
        "I tried this case and used some simple Chinese sentences I made up in my mind to test and it worked well. But when I tried to use some sentences from academic papers and novels in English, its performance was so poor no matter I use GPT-4 or any Chinese model to drive. So I decided this case was failure and this case should not be post.\n",
        "\n",
        "But in the morning of the next day, a thought come across me that maybe it's not about the sentence cases are made up or not, maybe it's about the language differece! As we know, Chinese is an ideographic script language, that means the probability of a single Chinese character or a phrase composed of two or three Chinese characters being fixed in usage is higher than that of phonetic scripts like English. That may cause the different test result effects in this case using sentences in different languages. In this case, what we want to do is so alike the law that how transformer model works. Maybe in Chinese this case will have a good output. I tried that with 5 sentences I picked from Agently document, google colab introduction document and some Chinese novels sentences I found from douban.com and the result seems good using ERNIE(WenXin 4.0) language model. (But GPT still doing not so good)\n",
        "\n",
        "So anyway, I post this case out to show the ideas and hope it can help you in your work.\n",
        "\n",
        "这个案例是一个挺有意思的案例。这是一个讨论群里的开发者提出的业务场景问题，在使用OCR扫描的时候，经常会遇到各种扫描原因的文字断行、分页导致应该连接的文字行被切分到不同文本块里的情况。那么应该怎么找到这些文本块的正确连接匹配对呢？\n",
        "\n",
        "我在写完这个案例的时候，一开始使用的是我自己瞎编的几个中文句子进行测试，发现测试的效果还不错，也把方案反馈给提问的那个开发者了。但是当我想把这个案例发出来的时候，我觉得应该使用更真实的案例。于是我使用了手头上的一篇英文论文进行了测试，发现效果非常不理想。而且无论我切换使用GPT-4(包括稳定版和1106预览版）还是其他支持的中文模型，都无法得到满意的结果。所以我认为这个案例失败了。\n",
        "\n",
        "但是第二天早上，我突然有了一个想法：有没有可能测试结果不理想，跟是不是使用了真实文本没有关系，而是跟语种有关？因为我们知道，中文是表意语言，而英文是表音语言，对于表意语言而言，单字和少量单字组成的词组，在使用的时候的组合固定性要更高，包含更紧密语义关系的连接也更多。这和Transformer架构的模型工作原理——预测下一个单字也更加契合。有没有可能这个方案在英文场景下不适用，但是能够在以中文为代表的表意文字体系下能够得到好的结果？\n",
        "\n",
        "基于这个想法，我尝试从Agently的案例文档，Google Colab的中文使用说明文档和豆瓣上随便搜索的2篇小说介绍里截取了5句话，使用下面的代码进行了测试，从百度文心4.0模型得到的结果符合预期，GPT的返回结果仍然不太行。因为Agently的案例文档是最近1到2周才撰写的，从预训练的角度看，这些语料不太可能被作为模型训练语料，因此有可能推断，文心4.0在这个中文NLP任务上，展现出了较强的能力。\n",
        "\n",
        "所以，我还是把这个案例展示出来，它可能只能在中文语境中被应用，但也希望能给大家的工作带来一些启发和帮助。"
      ],
      "metadata": {
        "id": "kyLFmv_l-aIx"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 1: Install Packages"
      ],
      "metadata": {
        "id": "nRsfMu4lAJZF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install -U Agently"
      ],
      "metadata": {
        "id": "nsst7pOAANlr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 2: Demo Code"
      ],
      "metadata": {
        "id": "_-1gryYwASPM"
      }
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "w_D2P1MhFTy0"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lLaK-w-E-ZKU",
        "outputId": "b6c3c6fe-248e-41d1-f0c8-20d3f399a519"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[Request Data]\n",
            " {\n",
            "    \"messages\": [\n",
            "        {\n",
            "            \"role\": \"user\",\n",
            "            \"content\": \"# [输入]:\\nheaders:\\n- 面功能（如调试和资源利用率监控器）。\\n- 的知识储备和身体素质。同时，任小粟还通过这个手段\\n- 无赖，却又智勇双全。\\n- 的老板、同事又不会写SQL，他们只能找你这个“专家”。\\n- 插件的时候，只有和当前状态相对应的行为指导和样例台词会被传给agent\\ntailers:\\n- 而且总是处理“提数”这件事可能把你深度思考的时间切得粉碎。但你能怎么办呢？你\\n- Docker 映像包含我们的托管运行时环境 (https://colab.research.google.com) 中提供的软件包，并支持某些界\\n- 通过获取他人的正面情绪得到货币值兑换异能，进而增强自己\\n- 要注意，当你使用Status能力\\n- 大哥帝林沉稳移谋，二哥斯特林统军有方，老三紫川秀被人称之为\\n\\n\\n# [处理规则]:\\nStep 1: Pick 1 item from {tailers}\\nStep 2: Pick 1 item from {headers} required that can be connected with the item from {tailers} in Step 1 and make the sentence whole or reasonable\\nStep 3: Only generate the readable and reasonable connection result.\\nStep 4: Redo these steps until you find out all tailer header pairs.\\n\\n# [输出要求]:\\n## TYPE:\\nJSON can be parsed in Python\\n## FORMAT:\\n[\\n\\t\\n\\t{\\n\\t\\t\\\"connect_result\\\": <String>,//connection result using 1 item in {tailers} and 1 item in {headers} and the result is readable or meaningful\\n\\t\\t},,\\n\\t...\\n]\\n\\n\\n[输出]:\\n\"\n",
            "        }\n",
            "    ],\n",
            "    \"stream\": true,\n",
            "    \"model\": \"ernie-bot-4\"\n",
            "}\n",
            "[Realtime Response]\n",
            "\n",
            "```json\n",
            "[\n",
            "  {\n",
            "    \"connect_result\": \"要注意，当你使用Status能力的时候，只有和当前状态相对应的行为指导和样例台词会被传给agent\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"大哥帝林沉稳移谋，二哥斯特林统军有方，老三紫川秀被人称之为无赖，却又智勇双全。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"而且总是处理“提数”这件事可能把你深度思考的时间切得粉碎。但你能怎么办呢？你的老板、同事又不会写SQL，他们只能找你这个“专家”。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"Docker 映像包含我们的托管运行时环境 (https://colab.research.google.com) 中提供的软件包，并支持某些界面功能（如调试和资源利用率监控器）。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"通过获取他人的正面情绪得到货币值兑换异能，进而增强自己的知识储备和身体素质。同时，任小粟还通过这个手段\"\n",
            "  }\n",
            "]\n",
            "```\n",
            "--------------------------\n",
            "\n",
            "[Final Reply]\n",
            " ```json\n",
            "[\n",
            "  {\n",
            "    \"connect_result\": \"要注意，当你使用Status能力的时候，只有和当前状态相对应的行为指导和样例台词会被传给agent\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"大哥帝林沉稳移谋，二哥斯特林统军有方，老三紫川秀被人称之为无赖，却又智勇双全。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"而且总是处理“提数”这件事可能把你深度思考的时间切得粉碎。但你能怎么办呢？你的老板、同事又不会写SQL，他们只能找你这个“专家”。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"Docker 映像包含我们的托管运行时环境 (https://colab.research.google.com) 中提供的软件包，并支持某些界面功能（如调试和资源利用率监控器）。\"\n",
            "  },\n",
            "  {\n",
            "    \"connect_result\": \"通过获取他人的正面情绪得到货币值兑换异能，进而增强自己的知识储备和身体素质。同时，任小粟还通过这个手段\"\n",
            "  }\n",
            "]\n",
            "``` \n",
            "--------------------------\n",
            "\n",
            "[Parse JSON to Dict] Done\n",
            "\n",
            "--------------------------\n",
            "\n",
            "[{'connect_result': '要注意，当你使用Status能力的时候，只有和当前状态相对应的行为指导和样例台词会被传给agent'}, {'connect_result': '大哥帝林沉稳移谋，二哥斯特林统军有方，老三紫川秀被人称之为无赖，却又智勇双全。'}, {'connect_result': '而且总是处理“提数”这件事可能把你深度思考的时间切得粉碎。但你能怎么办呢？你的老板、同事又不会写SQL，他们只能找你这个“专家”。'}, {'connect_result': 'Docker 映像包含我们的托管运行时环境 (https://colab.research.google.com) 中提供的软件包，并支持某些界面功能（如调试和资源利用率监控器）。'}, {'connect_result': '通过获取他人的正面情绪得到货币值兑换异能，进而增强自己的知识储备和身体素质。同时，任小粟还通过这个手段'}]\n"
          ]
        }
      ],
      "source": [
        "import Agently\n",
        "\n",
        "agent_factory = Agently.AgentFactory(is_debug=True)\n",
        "\n",
        "agent = agent_factory.create_agent()\\\n",
        "    .set_settings(\"current_model\", \"ERNIE\")\\\n",
        "    .set_settings(\"model.ERNIE.auth\", { \"aistudio\": \"\" })\n",
        "\n",
        "headers = [\n",
        "    \"面功能（如调试和资源利用率监控器）。\",\n",
        "    \"的知识储备和身体素质。同时，任小粟还通过这个手段\",\n",
        "    \"无赖，却又智勇双全。\",\n",
        "    \"的老板、同事又不会写SQL，他们只能找你这个“专家”。\",\n",
        "    \"插件的时候，只有和当前状态相对应的行为指导和样例台词会被传给agent\",\n",
        "]\n",
        "\n",
        "tailers = [\n",
        "    \"而且总是处理“提数”这件事可能把你深度思考的时间切得粉碎。但你能怎么办呢？你\",\n",
        "    \"Docker 映像包含我们的托管运行时环境 (https://colab.research.google.com) 中提供的软件包，并支持某些界\",\n",
        "    \"通过获取他人的正面情绪得到货币值兑换异能，进而增强自己\",\n",
        "    \"要注意，当你使用Status能力\",\n",
        "    \"大哥帝林沉稳移谋，二哥斯特林统军有方，老三紫川秀被人称之为\"\n",
        "]\n",
        "\n",
        "result = agent\\\n",
        "    .input({\n",
        "        \"headers\": headers,\n",
        "        \"tailers\": tailers,\n",
        "    })\\\n",
        "    .instruct(\n",
        "\"\"\"Step 1: Pick 1 item from {tailers}\n",
        "Step 2: Pick 1 item from {headers} required that can be connected with the item from {tailers} in Step 1 and make the sentence whole or reasonable\n",
        "Step 3: Only generate the readable and reasonable connection result.\n",
        "Step 4: Redo these steps until you find out all tailer header pairs.\"\"\")\\\n",
        "    .output([{\n",
        "        \"connect_result\": (\"String\", \"connection result using 1 item in {tailers} and 1 item in {headers} and the result is readable or meaningful\"),\n",
        "    }])\\\n",
        "    .start()\n",
        "print(result)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n",
        "[**_<font color = \"red\">Agent</font><font color = \"blue\">ly</font>_** Framework - Speed up your AI Agent Native application development](https://github.com/Maplemx/Agently)"
      ],
      "metadata": {
        "id": "b8QQSLDyMMN9"
      }
    }
  ]
}